Correlation and Regression

Matthew Talluto

22.02.2022

Covariance

\[ pr(y|x) = pr(y) \]

We can use the plot function to make a bivariate scatterplot in order to visualise this:

x = rnorm(1000)
y = rnorm(1000) # these variables are independent

plot(x, y, pch=16, bty='n')
Independent random variables

Independent random variables

Covariance

\[ \mathrm{cov}_{xy} = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \]

# in R:
c(  top =    cov(x1, y1),
    middle = cov(x2, y2),
    bottom = cov(x3, y3))
##      top   middle   bottom 
##  0.80012 -0.41042 -0.00761

Correlation

\[ \begin{aligned} \mathrm{cov}_{xy} & = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \\ r_{xy} & = \frac{\mathrm{cov}_{xy}}{s_x s_y} \end{aligned} \]

# in R:
c(  top =    cor(x1, y1),
    middle = cor(x2, y2),
    bottom = cor(x3, y3))
##      top   middle   bottom 
##  0.89570 -0.39786 -0.00738

Correlation significance testing

\(H_0\): \(r = 0\)

\(H_A\): two sided (\(\rho \ne 0\)) or one-sided (\(\rho > 0\) or \(\rho < 0\))

\(r\) has a standard error:

\[ s_{r} = \sqrt{\frac{1-r^2}{n-2}} \] We can then compute a \(t\)-statistic:

\[ t = \frac{r}{s} \]

The probability that \(t > \alpha\) (i.e., use the CDF of the t distribution) is the p-value.

Correlation test in R

\[ \begin{aligned} s_{r} &= \sqrt{\frac{1-r^2}{n-2}} \\ t &= \frac{r}{s} \end{aligned} \]

data(iris)
iris_set = subset(iris, Species == "setosa")
n = nrow(iris_set)
r = cor(iris_set$Sepal.Length, iris_set$Petal.Length)
r
## [1] 0.267
s_r = sqrt((1-r^2)/(n-2))
s_r
## [1] 0.139
t_val = r/s_r
t_val
## [1] 1.92
2 * pt(t_val, n-2, lower.tail = FALSE) ## two-sided test
## [1] 0.0607

Visualisation

The classic way to visualise is a scatterplot

setosa_plot = ggplot(iris_set, aes(x = Sepal.Length, y = Petal.Length)) + 
    geom_point() + xlab("Setosa Sepal Length") + ylab("Setosa Petal Length")
setosa_plot

# equivalent in base graphics
# plot(Petal.Length ~ Sepal.Length, data = iris_set, pch=16, 
#     xlab = "Setosa Sepal Length", ylab = "Setosa Petal Length", bty='n')

Correlation test in R

\[ \begin{aligned} s_{r} &= \sqrt{\frac{1-r^2}{n-2}} \\ t &= \frac{r}{s} \end{aligned} \]

data(iris)
iris_set = subset(iris, Species == "setosa")
n = nrow(iris_set)
r = cor(iris_set$Sepal.Length, iris_set$Petal.Length)
r
## [1] 0.267
s_r = sqrt((1-r^2)/(n-2))
s_r
## [1] 0.139
t_val = r/s_r
t_val
## [1] 1.92
2 * pt(t_val, n-2, lower.tail = FALSE) ## two-sided test
## [1] 0.0607
# Equivalent built-in test
# same test, with a formula instead
# cor.test(~ Sepal.Length + Petal.Length, data = iris_set, alternative = "two.sided"))
with(iris_set, 
     cor.test(Sepal.Length, Petal.Length, alternative = "two.sided"))
## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Length and Petal.Length
## t = 2, df = 48, p-value = 0.06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0121  0.5078
## sample estimates:
##   cor 
## 0.267

Correlation test: assumptions

Correlation pitfalls

Correlation pitfalls

Heterogeneity of subgroups

Spearman correlation

cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1, df = 148, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2447  0.0734
## sample estimates:
##     cor 
## -0.0879

Spearman correlation

cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1, df = 148, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2447  0.0734
## sample estimates:
##     cor 
## -0.0879
cor.test(x, y, method = 'spearman')
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 8e+05, p-value = 9e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## -0.357

Tests of association: 2 × 1 Tables

Mendel’s peas

Observation:

colour F2
violet 705
white 224

source

Tests of association: 2 × 1 Tables

Mendel’s peas

Observation:

colour F2
violet 705
white 224

\(H_0\): Inheritance is Mendelian (violet:white = 3:1)

\(H_A\): Inheritance is not exactly Mendelian

# Test of proportions against a null hypothesis
# For small sample sizes, use binom.test
counts = matrix(c(705, 224), ncol = 2)
prop.test(counts, p = 0.75, alternative = "two.sided")
## 
##  1-sample proportions test with continuity correction
## 
## data:  counts, null probability 0.75
## X-squared = 0.3, df = 1, p-value = 0.6
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
##  0.730 0.786
## sample estimates:
##     p 
## 0.759

Tests of association: n × n Tables

Nesting holes of black-backed woodpeckers.

woodpecker = read.csv("data/woodpecker.csv")
head(woodpecker)
##   forest_type nest_tree
## 1      burned     birch
## 2      burned     birch
## 3      burned jack pine
## 4      burned     aspen
## 5      burned     birch
## 6      burned jack pine

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

We want to test for an association between the two variables (forest type and nest tree)

\(H_0\): Nesting tree is not associated with forest type \(H_A\): Nest tree is associated with forest type

Chi-squared test

Nesting holes of black-backed woodpeckers.

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

table(woodpecker)/rowSums(table(woodpecker))
##            nest_tree
## forest_type  aspen  birch jack pine
##    burned   0.2500 0.6667    0.0833
##    unburned 0.4143 0.2571    0.3286

We want to test for an association between the two variables (forest type and nest tree)

\(H_0\): Nesting tree is not associated with forest type \(H_A\): Nest tree is associated with forest type

# for 2x2 tables with small sample sizes: Fisher's exact test
# fisher.test()
with(woodpecker, chisq.test(forest_type, nest_tree))
## 
##  Pearson's Chi-squared test
## 
## data:  forest_type and nest_tree
## X-squared = 14, df = 2, p-value = 0.001

Visualisation

Plots in ggplot have 3 required components.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Visualisation

Plots in ggplot have 3 required components.

library(ggplot2)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length))

Visualisation

Plots in ggplot have 3 required components.

iris_plot = ggplot(iris, aes(x = Petal.Width, 
                y = Petal.Length)) + geom_point()
iris_plot

Visualisation: Ordinal Data

Scatterplots become less useful.

birddiv = read.csv("data/birddiv.csv")
bird_plot = ggplot(birddiv, aes(x=forest_frag, 
                y = richness, colour = bird_type)) + 
                geom_point()
head(birddiv)
##   Grow.degd For.cover  NDVI bird_type richness forest_frag
## 1       330      99.9 60.38    forest        8           1
## 2       330       0.0 22.88    forest        1           0
## 3       330      38.3 11.86    forest        5           3
## 4       330      11.4 19.07    forest        7           7
## 5       330       0.0  2.12    forest        2           0
## 6       170     100.0 54.03    forest        7           1

Visualisation: Ordinal Data

Adding jitter can sometimes improve things

bird_plot = ggplot(birddiv, aes(x=forest_frag, 
                y = richness, colour = bird_type)) + 
                geom_jitter()

Visualisation: Ordinal Data

Better solution: boxplots

bird_plot = ggplot(birddiv, aes(x=as.factor(forest_frag), 
                y = richness, fill = bird_type)) + 
                geom_boxplot()
bird_plot = bird_plot + xlab("Forest Fragmentation")

Visualisation: Categorical Data

You see this a lot. When should you do it? NEVER

Visualisation: Categorical Data

This is almost as bad, and sadly much more common

Visualisation: Categorical Data

Barplots, or proportional bars for counts within categories

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
                fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5)
woodp_plot

Visualisation: Categorical Data

Barplots, or proportional bars for counts within categories

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
                fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5, 
                    position=position_dodge())
woodp_plot = woodp_plot + xlab("Nest Tree Type") +
    theme_minimal() + labs(fill = "Forest Type")
woodp_plot

The linear model

\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]

\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:

\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

Method of least squares

\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

We want to solve these equations for \(a\) and \(b\)

The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))

Method of least squares

\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]

\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]

Ordinary least squares estimation

Solving for the minimum ESS yields:

\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]

The parameters have standard errors:

\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]

\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \]

Significance tests

\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))

The ratio of variance explained to residual variance follows an \(F\) distribution

\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-1} \end{aligned}\]

Goodness of fit

The coefficient of determination tells you the proportion of variance explained by the model:

\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]

Regression assumptions

set.seed(123)
y = rpois(200, 7.5)
c(
  mean(log(y)),
  log(mean(y)))
## [1] 1.96 2.02
y = c(y, 0)
c(
  mean(log(y)),
  log(mean(y)))
## [1] -Inf 2.02

Transformations

Regression in R

## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17), 
        wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2, 3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
## 
## Call:
## lm(formula = wing_length ~ age, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3070 -0.2154  0.0655  0.1632  0.2251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7131     0.1479    4.82  0.00053 ***
## age           0.2702     0.0135   20.03  5.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.218 on 11 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.971 
## F-statistic:  401 on 1 and 11 DF,  p-value: 5.27e-10
confint(mod)
##             2.5 % 97.5 %
## (Intercept) 0.388   1.04
## age         0.241   0.30

Regression in R: diagnostics

par(mfrow=c(1, 2), bty='n')

## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)


## also try
## plot(mod)

Presenting regression output

We found a significant positive relationship between wing length and age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1)

Table 1. Parameter estimates for regression of wing length on age
estimate st. error 95% CI
intercept 0.71 0.150 0.39, 1.03
age 0.27 0.013 0.24, 0.30